Data Lake | ifkarsyah

Projects

Kadita — Config-Driven Data Ingestion Platform

A Kubernetes-inspired YAML-configured data platform that ingests from Postgres, MySQL, MongoDB, Jira, Zendesk, and S3 into an Apache Iceberg data lake.

Data Lake Apache IcebergS3Python

↗

Blog Posts

Nov 17, 2024

Iceberg Series, Part 6: Multi-Engine & Maintenance

Querying Iceberg from Trino, Flink, and DuckDB; expiring snapshots; rewriting data files; and keeping Iceberg tables healthy in production.

Data Lake Apache Iceberg

→

Nov 10, 2024

Iceberg Series, Part 5: Row-Level Operations

How MERGE, UPDATE, and DELETE work in Iceberg — copy-on-write vs merge-on-read, when to use each, and the performance trade-offs.

Data Lake Apache Iceberg

→

Nov 3, 2024

Iceberg Series, Part 4: Hidden Partitioning & Evolution

Partition transforms that derive partition values automatically, partition evolution that changes strategy without rewriting data, and why these are Iceberg's biggest ergonomic wins.

Data Lake Apache Iceberg

→

Oct 27, 2024

Iceberg Series, Part 3: Catalogs

How Hive, Glue, REST, and Nessie catalogs coordinate multi-engine access to Iceberg tables — and why the catalog abstraction is Iceberg's biggest differentiator.

Data Lake Apache Iceberg

→

Oct 20, 2024

Iceberg Series, Part 2: Table Format Internals

The four-layer metadata hierarchy — table metadata, manifest lists, manifest files, and data files — and how it enables efficient scans and snapshot isolation.

Data Lake Apache Iceberg

→

Oct 13, 2024

Iceberg Series, Part 1: Getting Started

Creating Iceberg tables with Spark, reads, writes, MERGE, time travel, and inspecting table history.

Data Lake Apache Iceberg

→

Oct 6, 2024

Iceberg Series, Part 0: Overview

What is Apache Iceberg, how does it differ from Delta Lake and Hudi, and why multi-engine interoperability is its defining advantage.

Data Lake Apache Iceberg

→

Sep 15, 2024

Delta Lake Series, Part 6: Streaming & CDC

Writing to Delta with Structured Streaming, exactly-once guarantees, reading Delta as a stream, and Change Data Feed for downstream propagation.

Data Lake Delta LakeSpark

→

Sep 8, 2024

Delta Lake Series, Part 5: Performance Optimization

Making Delta Lake queries fast — OPTIMIZE, Z-ordering, data skipping with column statistics, compaction, and partitioning strategies.

Data Lake Delta LakeSpark

→

Sep 1, 2024

Delta Lake Series, Part 4: Time Travel & Versioning

Querying historical snapshots by version or timestamp, rolling back bad writes, auditing the table history, and managing retention with VACUUM.

Data Lake Delta LakeSpark

→

Aug 25, 2024

Delta Lake Series, Part 3: Schema Enforcement & Evolution

How Delta Lake validates schemas on write, rejects incompatible data, and handles controlled schema changes over time.

Data Lake Delta LakeSpark

→

Aug 18, 2024

Delta Lake Series, Part 2: Transaction Log & ACID

How the Delta Lake transaction log enables atomicity, serializable isolation, optimistic concurrency, and conflict resolution.

Data Lake Delta LakeSpark

→

Aug 11, 2024

Delta Lake Series, Part 1: Getting Started

Creating Delta tables, reading and writing with Spark, Delta SQL, and what the _delta_log looks like in practice.

Data Lake Delta LakeSpark

→

Aug 4, 2024

Delta Lake Series, Part 0: Overview

The data lake reliability problem, what Delta Lake adds on top of Parquet, and how it compares to Apache Iceberg and Apache Hudi.

Data Lake Delta LakeSpark

→